In this problem set we will use a small sample of data from the General Social Survey. The survey is designed to monitor changes in both social characteristics and attitudes. You will work with a sample from one neighborhood. The full neighborhood of ALL individuals is the population. For this problem set we do not know the true population parameters for any of the variables, because we do not have data on every person in the neighborhood.
First load the necessary packages:
# Recall that loading the tidyverse "umbrella" package loads ggplot2, dplyr, and
# readr all at once. Feel free to load these packages any way you choose.
library(tidyverse)
library(moderndive)
Next, load the data set from where it is stored on the web:
if(!dir.exists("./Data")){dir.create("./Data")}
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSypSoDCMH2N76Vo2dZRPkw2q3t1mbvAXlOtgPDIsHg4NclAQFmER-BdvXH9_lrT40UQCVdPXOi_NMJ/pub?gid=257689625&single=true&output=csv"
if(!file.exists("./Data/gss_sample.csv")){ download.file(url, destfile = "./Data/gss_sample.csv")}
gss_sample <- read_csv("./Data/gss_sample.csv")
DT::datatable(gss_sample, rownames = FALSE)
Be sure to take a look at the data in Figure ??. Each row in the data set is a person that was surveyed (100 rows or cases in total). The variables in the data set include each respondent’s age, race, and number of hours of TV watched a day tvhours.
Setting a seed: We will take some random samples and build sampling distributions in this lab. In order to make sure R takes the same random sample every time you run your code, you can do what is called “setting a seed”. Do this in any code chunk that you take a random sample!
You can set a seed like so. Any number will do. (You do not need to run this right now…just showing you how)
set.seed(45)
The following code tells R to take 1000 bootstrap resamples from the gss_sample data. You can set the seed to whatever value you like!
set.seed(42)
boot_samp_1000 <- gss_sample %>%
rep_sample_n(size = nrow(gss_sample), reps = 1000, replace = TRUE)
Note a few important details about the rep_sample_n function, and bootstrap sampling in general:
size = nrow(gss_sample) tells R that each bootstrap resample we take has 100 cases… the size of the original sample.reps = 1000 tells R to take 1000 bootstrap resamples (each of size 100).replace = TRUE argument tells R that in each bootstrap resample, we can include a row from gss_sample multiple times. So if for instance, respondent # 12 is the first random resample taken here, respondent 12 is still available to be resampled again at random. Thus, some people may appear multiple times in our bootstrap resample, and some people from the original data set may not appear at all.boot_samp_1000.Take a look at the boot_samp_1000 data frame we just generated in Figure 2.1. Note that the replicate column labels each bootstrap resample (the first 100 rows are labeled 1, the next 100 rows are labeled 2, etc.)
DT::datatable((boot_samp_1000))
Figure 2.1: Bootstrap Distribution
boot_samp_1000 have? Why?Type your complete sentence answer here using inline R code and delete this comment.
Let’s say we want to use the bootstrap resample that we just generated to calculate a confidence interval for the population mean \(\mu_{tv}\) of tvhours. To do so, we need to know the sample mean \(\bar{x}\) of tvhours for each of the 1,000 bootstrap resamples. In this case, the sample mean \(\bar{x}\) of tvhours for each bootstrap resample is our BOOTSTRAP STATISTIC. We can calculate that with three lines of code as follows:
boot_distrib_tv <- boot_samp_1000 %>%
group_by(replicate) %>%
summarize(stat = mean(tvhours))
# Viewing the data
head(boot_distrib_tv)
| replicate | stat |
|---|---|
| 1 | 2.72 |
| 2 | 3.45 |
| 3 | 2.96 |
| 4 | 2.80 |
| 5 | 3.16 |
| 6 | 2.77 |
Note that:
group_by() argument tells R to take the sample mean of tvhours separately for each different replicate in the bootstrap resample.statThis is the bootstrap distribution for the mean of
tvhours!
# or using infer
library(infer)
boot_dist_tv_infer <- gss_sample %>%
specify(response = tvhours) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
# using a for loop
B <- 1000
bs_mean <- numeric(B)
for(i in 1:B){
bss <- sample(gss_sample$tvhours, size = sum(!is.na(gss_sample$tvhours)), replace = TRUE)
bs_mean[i] <- mean(bss)
}
Take a look at the boot_distrib_tv we just created in RStudio’s data viewer.
stat are there in the object boot_distrib_tv? Please explain why there are this many values of the bootstrap statistic.Type your complete sentence answer here using inline R code and delete this comment.
The bootstrap distribution is shown in Figure 2.2 . This is a histogram of the stat values from boot_distrib_tv.
ggplot(data = boot_distrib_tv, aes(x = stat)) +
geom_histogram(color = "white", binwidth = 0.25) +
labs(title = "Bootstrap distribution",
x = "boostrap statistic (mean tvhours)")
Figure 2.2: Bootstrap distribution of mean TV hours
# Or using the infer pipeline
visualize(boot_dist_tv_infer, bins = 9)
Figure 2.3: Bootstrap distribution computed with visualize()
hist(bs_mean, breaks = "Scott", main = "Bootstrap Distribution",
xlab = expression(paste(bar(x),"*")), col = "lightblue")
Figure 2.4: Bootstrap distribution of mean TV hours (base R histogram)
We can now use the bootstrap distribution for the sample mean tvhours \(\bar{x}\) to calculate a 95% confidence interval for the population mean tvhours \(\mu_{tv}\), using the “95% rule for bell shaped distributions”, which states that the middle 95% of values of a bell/normal shaped distribution are between
\[\text{mean} \pm 1.96 \cdot SD\]
We can thus apply the 95% rule, like so:
# Note that z_{0.975} = 1.96
qnorm(0.975)
[1] 1.959964
(xbar <- mean(gss_sample$tvhours))
[1] 3.14
boot_dist_tv_infer %>%
summarize(se = sd(stat),
lower_ci = xbar - (qnorm(0.975) * se),
upper_ci = xbar + (qnorm(0.975) * se)) -> bnci_tv
bnci_tv
| se | lower_ci | upper_ci |
|---|---|---|
| 0.3460207 | 2.461812 | 3.818188 |
#
standard_error_ci <- boot_dist_tv_infer %>%
get_confidence_interval(type = "se", point_estimate = xbar, level = 0.95)
standard_error_ci
| lower_ci | upper_ci |
|---|---|
| 2.461812 | 3.818188 |
You can also calculate a 95% confidence interval using the percentile method. The logic goes like this:
Since our bootstrap resample had 1000 values of stat:
stat values fall inside this 95% confidence interval, i.e. 95%totaling 100%. We can use the quantiles of the bootstrap distribution to find these values as follows:
bpci_tv <- boot_distrib_tv %>%
summarize(lower_ci = quantile(stat, 0.025),
upper_ci = quantile(stat, 0.975))
bpci_tv
| lower_ci | upper_ci |
|---|---|
| 2.51 | 3.89 |
# Or using infer
boot_distrib_tv %>%
get_confidence_interval(type = "percentile", level = 0.95)
| lower_ci | upper_ci |
|---|---|
| 2.51 | 3.89 |
# Which is really just doing the following:
PCI <- boot_distrib_tv %>%
summarize(lower_ci = quantile(stat, 0.025),
upper_ci = quantile(stat, 0.975))
PCI
| lower_ci | upper_ci |
|---|---|
| 2.51 | 3.89 |
This method
stat fall (or 25 cases in this example… 25/1000 = 0.025)stat fall (or 25 cases in this example 975/1000 = 0.975)Based on these results, we are 95% confident that the true mean hours of TV watched \(\mu_{tv}\) in the population is between the lower (2.51 hours) and upper (3.89 hours) CI endpoints we just calculated.
The bootstrap distribution and the 95% bootstrap percentile confidence intervals we just calculated are shown in the figure below. This is a histogram of the stat values from boot_distrib_tv. The green line is the lower bound of the 95% CI, and the blue line is the upper bound. 950 of the 1000 bootstrap resamples had a mean for tvhours that fell between the green and blue dashed lines…25 of the samples had a mean above the blue dashed line, and 25 of the samples had a mean below the green dashed line.
ggplot(data = boot_distrib_tv, aes(x = stat)) +
geom_histogram(color = "black", fill = "pink", binwidth = 0.15) +
labs(title = "Bootstrap distribution with 95% CI",
x = "boostrap statistic (mean tvhours)") +
geom_vline(data = bpci_tv, aes(xintercept = lower_ci), color = "green", lwd = 1, lty = "dashed") +
geom_vline(data = bpci_tv, aes(xintercept = upper_ci), color = "blue", lwd = 1, lty = "dashed") +
theme_bw()
boot_dist_tv_infer %>% visualize() +
shade_confidence_interval(endpoints = PCI, color = "red", fill = "pink")
tvhours using this same bootstrap resample and the percentile method, roughly how many of the 1000 values of tv_mean would fall between the lower_ci and the upper_ci?Type your complete sentence answer here using inline R code and delete this comment.
# Code to verify here
tvhours generated above (boot_distrib_tv) and the bootstrap percentile method to calculate a 99% bootstrap percentile confidence interval for the mean tvhours. Round your answer to two decimal places. Make sure to use inline R code to report your answer and include appropriate units with the confidence interval.# Type your code and comments inside the code chunk
Type your complete sentence answer here using inline R code and delete this comment.
tvhours \(\mu_{tv}\)? Why?Type your complete sentence answer here using inline R code and delete this comment.
boot_samp_1000), to generate a bootstrap distribution for the sample mean respondent age \(\bar{x}\) instead of tvhours. Please be sure to name it something different than the bootstrap distribution for the sample mean of tvhours# Type your code and comments inside the code chunk
# Using infer
age \(\mu_{age}\) using the 95% rule method.# Type your code and comments inside the code chunk
Type your complete sentence answer here using inline R code and delete this comment. Round your answers to two decimal places.
age \(\mu_{age}\).# Type your code and comments inside the code chunk
Type your complete sentence answer here using inline R code and delete this comment.
Type your complete sentence answer here using inline R code and delete this comment.
age and the percentile method to calculate an 80% confidence interval for the population mean respondent age \(\mu_{age}\).# Type your code and comments inside the code chunk
Type your complete sentence answer here using inline R code and delete this comment.
The procedure for generating a bootstrap sampling distribution is VERY similar for categorical data. As an example we will generate a bootstrap sampling distribution for the proportion of respondents that identified as a Person of Color.
We already did this above! We can use the same boot_samp_1000 as before.
boot_distrib_POC <- boot_samp_1000 %>%
group_by(replicate) %>%
summarize(n = n(),
POC_count = sum(race == "POC"),
boot_stat = POC_count/n,
phat_boot = mean(race == "POC"))
head(boot_distrib_POC)
| replicate | n | POC_count | boot_stat | phat_boot |
|---|---|---|---|---|
| 1 | 100 | 26 | 0.26 | 0.26 |
| 2 | 100 | 24 | 0.24 | 0.24 |
| 3 | 100 | 25 | 0.25 | 0.25 |
| 4 | 100 | 16 | 0.16 | 0.16 |
| 5 | 100 | 28 | 0.28 | 0.28 |
| 6 | 100 | 22 | 0.22 | 0.22 |
Note that with a categorical variable, the code differs in two important respects now:
race == "POC" in each replicate.The following will calculate the 95% confidence interval for the proportion of people that identified as POC using the 95% rule.
phat <- mean(gss_sample$race=="POC")
boot_distrib_POC %>%
summarize(se = sd(boot_stat),
lower_ci = phat - (qnorm(0.975) * se),
upper_ci = phat + (qnorm(0.975) * se))
| se | lower_ci | upper_ci |
|---|---|---|
| 0.0421354 | 0.1574161 | 0.3225839 |
The following will calculate the 95% confidence interval for the proportion of people that identified as “POC” using the percentile method.
boot_distrib_POC %>%
summarize(lower_ci = quantile(boot_stat, 0.025),
upper_ci = quantile(boot_stat, 0.975))
| lower_ci | upper_ci |
|---|---|
| 0.16 | 0.33 |
White.# Type your code and comments inside the code chunk
# Type your code and comments inside the code chunk
# 95% rule
# Type your code and comments inside the code chunk
Type your complete sentence answer here using inline R code and delete this comment.
As described in moderndive chapter 8.7.2, not only can we generate confidence intervals using a computer/resampling as we’ve been doing until now, in many cases there also exists a mathematical formula! This however necessitates a little mathematical/probability theory; a topic we leave to a more advanced statistics class.
To generate a \((1 - \alpha)\cdot 100\)% confidence interval based on the theoretical normal distribution, we can use the following formula:
\[\widehat{\text{point estimate}} \pm z_{1 - \frac{\alpha}{2}} \cdot \widehat{SE}\]
So, for instance if we wanted to calculate the 95% confidence interval for the population mean of tvhours \(\mu_{tv}\) that respondents watched based on our sample:
tvhours from the sample: \(\bar{x}\)\[\widehat{SE} \approx \frac{s}{\sqrt{n}}\]
and thus a 95% CI \(\rightarrow \alpha = 0.05 \rightarrow z_{1 - \frac{0.05}{2}} \rightarrow z_{0.975} = 1.959964\) would be
\[ \bar{x} \pm 1.96 \cdot \widehat{SE} = \bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{n}} \]
We can perform the calculations in R as follows:
x_bar = mean(gss_sample$tvhours)
gss_sample %>%
summarize(sd = sd(tvhours),
n = n(),
se = sd/sqrt(n),
lower_ci = x_bar - qnorm(.975) * se,
upper_ci = x_bar + qnorm(.975) * se) -> tci_tv
tci_tv
| sd | n | se | lower_ci | upper_ci |
|---|---|---|---|---|
| 3.592979 | 100 | 0.3592979 | 2.435789 | 3.844211 |
tvhours \(\mu_{tv}\) you’ve computed in this problem set. Do this by replacing
X, Y, A, B, P, and Q with the appropriate values you’ve computed.When you are done, make sure all the | in the table still line up so your results print out in a table!
| CI construction method | lower value | upper value |
|---|---|---|
| Using boostrap: 95% rule | X | Y |
| Using boostrap: percentile rule | A | B |
| Using mathematical formula | P | Q |
Type your complete sentence answer here using inline R code and delete this comment.